On-Line Probability, Complexity and Randomness

نویسندگان

Alexey V. Chernov

Alexander Shen

Nikolai K. Vereshchagin

Vladimir Vovk

چکیده

Classical probability theory considers probability distributions that assign probabilities to all events (at least in the finite case). However, there are natural situations where only part of the process is controlled by some probability distribution while for the other part we know only the set of possibilities without any probabilities assigned. We adapt the notions of algorithmic information theory (complexity, algorithmic randomness, martingales, a priori probability) to this framework and show that many classical results are still valid. 1 On-line probability distributions Consider the following “real-life” situation. There is a tournament (say, chess or football); before each game the referee tosses a coin to decide which player will start the next game. Assuming the referee is honest, we would be surprised to learn that, say, all 100 coin tosses have produced a tail. We would be surprised also if the result of the coin tossing always turned out to be equal to some (simple) function of the results of previous games. However, it is quite possible that the results of coin tossing can be easily computed from the results of subsequent games. Indeed, it may well happen that the coin bit influences the results of the subsequent games and therefore can be reconstructed if these results are known. Another similar example: if there were a rule that predicts the lucky numbers in a lottery using the previous day newspaper, we would not trust the lottery organizers. However, for the next day newspaper the situation is different (e.g., the newspaper may publish the results of the lottery). Let Xi be the information string available before the start of ith game (say, the text of the newspaper printed just before the game) and let the bit bi be the result of coin tossing at the start of ith game. We would like to say that for every function f and for every i the probability of the event bi = f(Xi) is 1/2, assuming the referee is honest. And for N games the probability of the event ∀i (bi = f(Xi)) equals 2−N . However, we cannot directly refer to classical probability theory framework in this example. Indeed, when speaking about probability of some event, one usually assumes that some probability distribution is fixed, and this distribution assigns probabilities to all possible events (at least in the finite case). In X1, no distribution b1, uniform distribution X2, no distribution b2, uniform distribution . . . Fig. 1. The tree of possibilities our example we do not have a probability distribution for Xi; the only thing we have is the “conditional probability” of the event bi = 1 for any condition X1, b1, . . . , Xi−1, bi−1, Xi; this conditional probability equals 1/2. Formally speaking, we get a “tree of possibilities”. The sons of the root are possible values of X1. Each of them has two sons that correspond to two possible outcomes of the first coin tossing (b1 = 0 or 1). Next level branching corresponds to the values of X2, then each vertex has two sons (b2 = 0 or 1), etc. In other words, tree vertices are finite sequences (X1, b1, . . . , Xk, bk) for even layers and (X1, b1, . . . , Xk, bk, Xk+1) for odd layers; Xi are binary strings and bi are bits. We may consider a finite tree with 2N layers; its leaves are sequences (X1, b1, . . . , XN , bN ). Or we may consider an infinite tree whose vertices are sequences of any length. What we have is not a probability distribution but something that can be called an on-line probability distribution on this tree. By definition, to specify an on-line probability distribution one must fix, for each i and for all values of X1, b1, . . . , Xi, two non-negative reals with sum 1. They are called conditional probabilities of 0 and 1 after X1, b1, . . . , Xi and denoted by Pr[bi = 0 or 1|X1, b1, . . . , Xi−1, bi−1, Xi]. For the case of a fair coin all these conditional probabilities are equal to 1/2. As usual, we can switch to unconditional probabilities (i.e., can multiply conditional probabilities on the path from the tree root). Then we arrive to the following version of the definition: an on-line probability distribution is a function P defined on tree vertices such that P (Λ) = 1 (Λ is the tree root), P (X1, b1, . . . , Xi, bi, Xi+1) = P (X1, b1, . . . , Xi, bi) (on vertices where no random choice is made, the function propagates without change), and P (X1, b1, . . . , Xi, bi, Xi+1) = = P (X1, b1, . . . , Xi, bi, Xi+1, 0) + P (X1, b1, . . . , Xi, bi, Xi+1, 1). The intuitive meaning of P (v) is the probability to arrive at v if the environment (that chooses X1, X2, . . .) wants this and makes suitable moves in its turn. This definition makes sense both for finite and infinite trees. Remark. A technical problem arises when some values of an on-line probability distribution are zeros: in this case conditional probabilities cannot be reconstructed from the products. However, in this case they are usually not important, so we can mostly ignore this problem. Similar on-line probability distributions can be considered for more general trees where on the odd levels, instead of 0 and 1, we have a (countable) list of possible values of bi. Now let us assume that the tree is finite (has finite height and finite number of vertices on every level). Consider an event E, i.e., some set of tree leaves. We cannot define the probability of an event under a given on-line probability distribution P . However, we can define an upper probability of E. (It may be called a “worst case probability” if the event E is considered undesirable.) This notion can be defined in several (equivalent) ways. Definition. (1) Consider all probability distributions on the leaves of the tree. Some of them are consistent with the given on-line probability distribution (i.e., give the same conditional probabilities for bi when X1, b1, . . . , Xi are given). Upper probability of E is a maximum of Pr[E] under all these distributions. (2) Consider the following probabilistic game: a player (“adversary”, if the event is undesirable) chooses some X1, then b1 is chosen at random with prescribed probabilities (condition X1), then player chooses X2, then b2 is chosen at random (according to the conditional probabilities with condition X1, b1, X2), etc. The player wins if the resulting leaf belongs to the event E. The upper probability of E is the maximal probability that player wins (maximum is taken over all deterministic strategies). (3) Let us define the cost of a vertex in the tree inductively starting from the leaves. For a leaf in E the cost is 1, for a leaf outside E the cost is zero. For a non-leaf vertex v where the choice of Xi is performed, the cost of v is the maximal cost of its sons; for a vertex that corresponds to the choice of bi, the cost is the weighted sum of the sons’ costs where weights are conditional probabilities. Upper probability of E is the cost of the tree root. (4) Let us consider on-line martingales with respect to P , i.e., non-negative functions V defined on tree vertices such that V (X1, b1, . . . , Xi, bi) = V (X1, b1, . . . , Xi, bi, Xi+1); (1) V (X1, b1, . . . , Xi) = V (X1, b1, . . . , Xi, 0) · Pr[bi = 0|X1, b1, . . . , Xi]+ + V (X1, b1, . . . , Xi, 1) · Pr[bi = 1|X1, b1, . . . , Xi]; (2) these functions correspond to the player’s capital in a fair game (when player observes Xi, the capital does not change; when player splits the capital between bets on bi = 0 and bi = 1, the winning bet is rewarded according to the conditional probabilities determined by the on-line distribution). The upper probability of E is the minimal value of V (Λ) over all V such that V ≥ 1 for all leaves that belong to E. In other terms, the upper probability of E is 1 divided by the fair price of the option to play such a game with initial capital 1 knowing in advance that the sequence of outcomes belongs to E. Remark. As we have mentioned, we need some precautions for the case when some values of P are zeros, since in this case conditional probabilities are not uniquely defined. However, it is easy to see that all choices of conditional probabilities compatible with P will lead to the same value of upper probability. Theorem 1. All four definitions are equivalent. Proof. Note that player’s strategy in the second definition determines a distribution on the leaves (Xi is chosen deterministically according to the strategy while bi is chosen according to the prescribed conditional probabilities). This distribution is consistent with the given on-line distribution. So the upper probability as defined in (2) does not exceed the upper probability as defined in (1). On the other hand, any probability distribution can be considered as a mixed strategy in the game (player chooses her moves randomly using independent random bits), and the winning probability of a mixed strategy is the weighted average of the winning probabilities for pure strategies, so we get the reverse inequality. The inductive definition (3) computes the winning probability for the optimal strategy (induction on tree height). The equivalence with the martingale definition can be proved in the same way as for the classical off-line setting (this argument goes back to Ville, see, e.g., [7]). If a martingale V starts with capital p and achieves 1 on every leaf in E, then for every probability distribution compatible with P and for every tree vertex the current value of V is an upper bound for the expectation of V if the game starts at this vertex. Therefore, V (Λ) is an upper bound for the probability to end the game in E for every probability distribution compatible with P . The reverse inequality: the vertex cost (defined inductively) satisfies the conditions in the definition of a martingale if we replace = by ≥ in the condition (1). Increasing this function, we can get a martingale. Remarks. 1. Note that upper probability is not additive: e.g., both the event and its negation can have upper probabilities 1, just the strategies to achieve them are different. However, it is sub-additive: the upper probability of A ∪ B does not exceed the sum of upper probabilities of A and B. 2. We can define supermartingales in the same way as martingales replacing = by ≥ in (2). We relax the requirement (2) and not (1) since it is more natural from the game viewpoint: getting information about Xi does not change the player’s capital. It is easy to see that supermartingales may be used instead of martingales in the the definition of upper probability. 3. Proving Theorem 1, we assumed that the tree is finite. However, the same argument shows that it is valid for infinite trees of finite height (and even for the trees having no infinite branches), if we use supremum instead of maximum. Classical probability theory says that events with very small probability can be safely ignored (and when they happen, we have to reconsider our assumptions about the probability distribution). In the on-line setting we can say the same about events that have very small upper probability: believing in the probabilistic assumption, we may safely ignore the possibilities that have negligible upper probabilities, and if such an event happens, we have to reconsider the assumption. Remarks. 1. In fact upper probability (though not with this name) is used in the definition of the Arthur – Merlin class in computational complexity theory where a tree of polynomial height and a polynomially decidable event are considered and we distinguish between events of low and high upper probability. 2. It is easy to see that on-line martingales with respect to on-line probability distribution P are just the ratios Q/P where Q is some other on-line probability distribution. (Some evident precautions are needed if P can be zero somewhere.) 2 On-line Kolmogorov complexity KR We can adapt the standard definition of Kolmogorov complexity (see, e.g., [3, 5] for the definition and discussion of different versions of Kolmogorov complexity) for the on-line setting. Consider a sequence X1, b1, X2, b2, . . . , Xn, bn where Xi are binary strings and bi are bits. Look for a shortest interactive program that after getting input X1 produces b1, then after getting X2 (in addition to X1) produces b2, then after getting X3 produces b3 etc. We call its length the on-line decision complexity with respect to the programming language π used, and denote it by KRπ(X1 → b1;X2 → b2; . . . ;Xn → bn). The reason for the name “decision complexity”: if all Xi are empty, we get the standard notion of decision complexity of a bit string b1 . . . bn (the length of the shortest program that generates bi given i). It is easy to see that a natural version of optimality theorem holds (there exists an optimal “programming language”), so the on-line decision complexity (for an optimal programming language) KR(X1 → b1;X2 → b2; . . . ;Xn → bn) is well defined (up to an additive O(1)-term). Theorem 2. The on-line complexity KR(X1 → b1; . . . ;Xn → bn) does not exceed the decision complexity KR(b1b2 . . . bn) and is greater than the conditional complexity KS(b1b2 . . . bn|X1, X2, . . . , Xn) up to O(1) terms. In other terms, knowing Xi in an on-line setting may help to describe b1, . . . , bn, but knowing all Xi in advance is even better. (The proof is straightforward.) 3 On-line a priori probability and KA It is well known that Kolmogorov complexity is related to the a priori probability (maximal lower semicomputable semimeasure). The latter can be naturally defined in the on-line setting. Let us give two equivalent definitions. Consider an interactive probabilistic machine T that has internal random bit generator. This machine gets some binary string X1 (say, on the tape where the end ofX1 is marked by a special separator), performs a computation that usesX1 and random bits and may produce bit b1 (or hang). After b1 is produced, T gets the second input string X2, continues its work (using fresh random bits) and may produce second output bit b2, etc. In other words, we write X1#X2# . . .#Xn on the input tape, but T cannot get access to Xi before it produces i− 1 output bits b1, . . . , bi−1. For a given T consider a function MT : Let MT (X1, b1, . . . , Xn, bn) be the probability that T outputs b1, . . . , bn getting X1, X2, . . . , Xn as input (with restrictions described above). We extend the function MT to the sequences of odd length: MT (X1, b1, . . . , Xn, bn, Xn+1) is equal to MT (X1, b1, . . . , Xn, bn). We let MT (Λ) = 1. It is easy to see that if T never hangs (or hangs with zero probability), then MT is an on-line probability distribution. In general, MT is an on-line semimeasure in the sense of the following Definition. An on-line semimeasure is a function M that maps tree vertices to non-negative reals such that M(Λ) = 1, M(X1, b1, . . . , Xi, bi, Xi+1) = M(X1, b1, . . . , Xi, bi) (on vertices where no random choice is made, the function propagates without change), and the inequality M(X1, b1, . . . , Xi, bi, Xi+1) ≥ M(X1, b1, . . . , Xi, bi, Xi+1, 0)+M(X1, b1, . . . , Xi, bi, Xi+1, 1) holds. (We have replaced “=” by “≥” in the definition of an on-line probability distribution.) It is easy to see that semimeasure MT that corresponds to a probabilistic machine T of described type is a lower semicomputable function, i.e., there is an algorithm that gets its input and produces an increasing sequence of rational numbers that converges to the value of the function. Theorem 3. Every lower semicomputable on-line semimeasure corresponds to some probabilistic machine. Proof is similar to the off-line case. For an on-line semimeasure M we perform a “memory allocation”, so that for each finite sequence X1, b1, . . . an open subset of [0, 1] that has measure M(X1, b1, . . .) is allocated. Adding Xi to the end of the sequence does not change the set; adding bits 0 and 1 replaces the corresponding set by two its disjoint subsets. (Note that these subsets may depend not only on bi, but also on Xi.) If M is lower semicomputable, these sets can be made uniformly effectively open. Then we consider a machine T that generates a uniformly distributed random real number α ∈ [0, 1] bit by bit and generates T (X1, b1, . . . , Xi) = bi if the effectively open set that corresponds to X1, b1, . . . , Xi, bi contains α. Theorem 4. There exists the largest (up to O(1)-factor) lower semicomputable on-line semimeasure. Proof. Again we can use standard trick: a universal machine first generates randomly a machine of described type in such a way that every machine appears with a positive probability, and then simulates this machine. We call this maximal semimeasure an on-line a priori probability and denote it by A(X1, b1, . . . , Xn, bn). (If allXi are empty strings, we get a standard a priori probability on a binary tree.) Minus logarithm of this semimeasure is called online a priori complexity and denoted by KA(X1 → b1;X2 → b2; . . . ;Xn → bn). 4 Relations between KR and KA Now, when two complexities KA and KR are defined in the on-line framework, one may ask how they are connected. Their off-line versions are close to each other: it is known that KR(x) ≤ KA(x) ≤ KR(x) + 2 logKR(x) (up to O(1)terms) for all binary strings x. These inequalities remain true for the on-line setting (with the same O(1)precision): Theorem 5. KR(. . .) ≤ KA(. . .) ≤ KR(. . .) + 2 logKR(. . .); here “. . . ” stands for X1 → b1;X2 → b2; . . . ;Xn → bn. Proof of the second inequality goes in the same way as usual; we consider a randomized algorithm that chooses machine number i with probability 1/i. The first inequality needs more care, since in the on-line case we are more restricted and need to ensure that programs are indeed on-line and do not refer to the inputs that are not yet available. We need to allocate 2 strings of length n to objects that have KA-complexity less than n (=have a priori probability greater than 2−n). We do it inductively (first for X1 → b1, then for X2 → b2, etc.) and ensure a stronger requirement: if a priori probability of some object exceeds k2−n for some k, then there are at least k different programs of length n allocated to this object. So we start looking at the approximations (from below) to the (a priori) probabilities of X1 → 0 and X1 → 1 (independently for each n and each X1); when probability of X1 → b1 exceeds k2−n, we allocate a new (kth) program of length n that transforms X1 to b1. On top of this process we look at the approximations to a priori probabilities of X1 → b1;X2 → b2 and add new programs that map X2 to b2 among the programs that mapped X1 to b1; we have enough programs for that sinceM(X1, b1, X2, 1)+M(X1, b1, X2, 0) ≤M(X1, b1), so if k1 programs are needed for the first term and k0 are needed for the second, then there are already k0 + k1 programs allocated to X1 → b1 to choose from. On top of that, we allocate programs for X1 → b1;X2 → b2;X3 → b3 etc. 5 On-line randomness Let us return to the “real-life” example and make it less real: imagine that we observe an infinite sequence of games and (for every i) know the bit bi produced by the referee when ith game starts and the string Xi that is known before ith game. There are cases when we intuitively reject the fair coin assumption. Can we make the intuition more formal and define a notion “in the sequence X1, b1, X2, b2, X3, b3, . . . the bits b1, b2, . . . are random”? For the off-line case the most popular notion is called Martin-Löf randomness (ML-randomness; see [3, 6] for details). Now we want to extend it to the on-line setting. Assume that a computable on-line probability distribution P (on the infinite tree) is fixed. Martin-Löf definition starts with a notion of an “effectively null” set. Adapting this definition to on-line setting, we need to remember that probability of events is now undefined; moreover, the notion of upper probability (that replaces it) has been defined for finite case only. Consider the space Π of all (infinite) sequences X1, b1, X2, b2, . . .. A cone in this set is a set of all sequences with given finite prefix. Definition. Let U be a finite union of cones. Then the upper probability of U with respect to P is defined as the upper probability of the corresponding event in the finite part of the tree (large enough to contain all the roots of the cones). (It is easy to see that this probability does not change if we increase the size of the finite part of the tree. The upper probability is monotone with respect to set inclusion.) Then we can define an on-line version of null sets. Definition. A set Z ⊂ Π is an on-line null set if for any ε > 0 there exists a sequence of cones such that: (1) the union of cones covers Z; (2) the union of any finite number of these cones has upper probability less than ε. Martin-Löf definition of randomness deals with effectively null sets, so our next step is to define them in an on-line setting. Definition. A set Z is an on-line effectively null set if there exists an algorithm that for any given rational ε > 0 generates a sequence of vertices such that the corresponding cones cover Z and the union of any finite number of these cones has upper probability less than ε. (Note that we require the upper probability of the union of the cones to be small, not the sum of upper probabilities of the cones. This difference matters since upper probability, unlike classical probability, is not additive.) Theorem 6. There exists an on-line effectively null set that contains every other on-line effectively null set. Proof is similar to the off-line case. Having any algorithm that given rational ε > 0 generates sequences of vertices, we can “trim” it so that the union of any finite number of generated cones has upper probability less than ε. (Indeed, for a computable on-line measure the upper probability of the finite union of cones is computable, and we may quarantine new strings until they are cleared.) So we can enumerate all the algorithms that satisfy these restrictions and then take the union of corresponding on-line effectively null sets (combining covers of size ε/2, ε/4 etc. to get the cover of size ε; here we use the subadditivity of upper probability).

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Bertrand’s Paradox Revisited: More Lessons about that Ambiguous Word, Random

The Bertrand paradox question is: “Consider a unit-radius circle for which the length of a side of an inscribed equilateral triangle equals 3 . Determine the probability that the length of a ‘random’ chord of a unit-radius circle has length greater than 3 .” Bertrand derived three different ‘correct’ answers, the correctness depending on interpretation of the word, random. Here we employ geomet...

متن کامل

Algorithmic Information Theory and Kolmogorov Complexity

This document contains lecture notes of an introductory course on Kolmogorov complexity. They cover basic notions of algorithmic information theory: Kolmogorov complexity (plain, conditional, prefix), notion of randomness (Martin-Löf randomness, Mises–Church randomness), Solomonoff universal a priori probability and their properties (symmetry of information, connection between a priori probabil...

متن کامل

Around Kolmogorov complexity: basic notions and results

Algorithmic information theory studies description complexity and randomness and is now a well known field of theoretical computer science and mathematical logic. There are several textbooks and monographs devoted to this theory [4, 1, 5, 2, 7] where one can find the detailed exposition of many difficult results as well as historical references. However, it seems that a short survey of its basi...

متن کامل

The Impact of Pre-task Planning Vs. On-line Planning on Writing Performance: A Test of Accuracy, Fluency, and Complexity

The aim of the current study was to compare the influence of on-line planning and pre-task planning on the performance of EFL university students enjoying different levels of proficiency regarding accuracy, fluency and complexity. To this end a group of 134 EFL learners with different proficiency levels were asked to write narrative tasks under two planning conditions (Pre-task planning and on-...

متن کامل

Is Randomness "Native" to Computer Science?

1 From probability theory to Kolmogorov complexity 3 1.1 Randomness and Probability theory . . . . . . . . . . . . . . . . . . . . . . 3 1.2 Intuition of finite random strings and Berry’s paradox . . . . . . . . . . . . 5 1.3 Kolmogorov complexity relative to a function . . . . . . . . . . . . . . . . . 6 1.4 Why binary programs? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.5...

متن کامل

Chaos/Complexity Theory and Education

Sciences exist to demonstrate the fundamental order underlying nature. Chaos/complexity theory is a novel and amazing field of scientific inquiry. Notions of our everyday experiences are somehow in connection to the laws of nature through chaos/complexity theory’s concerns with the relationships between simplicity and complexity, between orderliness and randomness (Retrieved from http://www.inc...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2008

On-Line Probability, Complexity and Randomness

نویسندگان

چکیده

منابع مشابه

Bertrand’s Paradox Revisited: More Lessons about that Ambiguous Word, Random

Algorithmic Information Theory and Kolmogorov Complexity

Around Kolmogorov complexity: basic notions and results

The Impact of Pre-task Planning Vs. On-line Planning on Writing Performance: A Test of Accuracy, Fluency, and Complexity

Is Randomness "Native" to Computer Science?

Chaos/Complexity Theory and Education

عنوان ژورنال:

اشتراک گذاری